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1. INTRODUCTION 
In machine learning, there are a lot of problems that lead to non-convex optimization. In this paper, 
we focus on the optimization problems in machine learning as follow 


x” = arg min f(x) (1) 


where the objective function f(x) is smooth (possibly non-convex) on the compact domain 2. 
To the best of our knowledge, if f(a) is convex, problem (1) is easy to solve by applying some convex 
optimization methods such as gradient descent (GD), Newton, or Stochastic GD [1]. But, in practice, non- 
convex models have several advantages compared with the convex one. For example, deep neural networks, 
which have been widely used in computer vision and data mining are highly non-convex optimization. In these 
cases, solving problem (1) is more difficult than a convex one because non-convex optimization usually admits 
a multimodal structure, and common convex optimization methods may trap in poor local optima. In this paper, 
we focus on proposing the new optimization method for solving problem (1) in which the objective function 
f(a) is non-convex and smooth. 

More and more researches consider in solving non-convex problem (1) such as stochastic variance 
reduced gradient (SVRG) [2], Proximal SVRG (Prox-SVRG) [3]. SVRG and Prox-SVRG can be used to 
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solve non-convex finite-sum problems, but we find out that they may not converge to the global optimization 
for non-convex functions. Concave-convex procedure (CCCP) [4] is widely used for non-convex problem. 
It transforms the non-convex problem into a sum of a convex function and a concave one, and then linearizing 
the concave function. However, the complexity of a single loop for CCCP is much higher than SGD, because 
CCCP solves quadratic programming at each iteration. Graduated optimization algorithm (GOA)[5] is also a 
popular global search algorithm for non-convex problems, but directly calculating its gradient is costly. In [6], 
the authors propose a new optimization namely GradOpt. Experimental results in [6] show that GradOpt can 
fast yield a much better solution than mini-batch SGD. However, [7] which proposes two algorithms: SVRG- 
GOA and PSVRG-GOA for solving the non-convex problem, shows that GradOpt has some shortcomings: it 
converges slowly partly due to the decrease of step-size, the application range of GradOpt is limited. The value 
of an objective function may be trapped around a number which is larger than the global minimum because 
the smooth parameter shrinks slightly after several iterations. We also have seen many famous stochastic op- 
timization algorithms such as Adagrad [8], RMSProp [9], Adam [10], Adadelta [11], RSAG [12], Natasha2 
[13], NEON2 [14] are proposed for solving the optimization problem in machine learning. The big challenges 
for non-convex optimization algorithms in machine learning are: Can local/global optimum be found? Is it 
possible to get rid of saddles? How to escape saddle points efficiently? Can the optimum solution be found 
with an acceptable time and with large data? Finding an optimum of a non-convex optimization problem is 
NP-hard in the worst case [15]. Despite the intractability results, non-convex optimization is the main 
algorithmic technique behind many state-of-the-art machine learning and deep learning results. In light of 
this background, we state the main contributions of our paper: 


a Using Bernoulli distribution and two stochastic approximation sequences, we develop GS-OPT for 
solving a wide class of non-convex problems. And we show that it usually performs better than 
previous algorithms. 


b Applying GS-OPT to solving the posterior inference problem in topic models, we obtain two learning 
methods: ML-GSOPT and Online-GSOPT in topic models. In addition, GS-OPT is very flexible, then 
we can adapt GS-OPT to solve many non-convex models in machine learning. 


Organization: This paper is structured as follows. In Section 2, a new algorithm for solving the 
non-convex optimization problem is proposed in detail. In Section 3, we have applied GS-OPT to solve the 
posterior inference in latent Dirichlet allocation and designed two methods of learning LDA. In Section 4, we 
give some results tested with two large datasets: New York Times and Pubmed. Finally, we conclude the paper 
in Section 5. 

Notation: Throughout the paper, we use the following conventions and notations. Bold faces denote 
vectors or matrices. x; denotes the 7” element of vector x, and Aj; denotes the element at row 7 and column 
j of matrix A. The unit simplex in the n-dimensional Euclidean space is denoted as A,, = {a € R" : a > 
0, Ley x, = 1}, and its interior is denoted as A,,. We will work with text collections with V dimensions 
(dictionary size). Each document d will be represented as frequency vector, d = (dj,..,dy)", where d; 
represents the frequency of term j in d. Denote nq as the length of d, i.e., ng = >> 5 d;. The inner product of 
vectors u and v is denoted as (u,v). I(x) is the indicator function which returns 1 if x is true, and 0 otherwise. 


2. PROPOSED STOCHASTIC OPTIMIZATION ALGORITHM 
We consider in the optimization problem as form as: 


a’ = arg min[f(a) = g(x) + h(e)] (2) 


where the non-convex objective function f(a) includes two components g(a) and h(a). We find out that 
numerous models fall in the framework of problem (2) in machine learning. For example, in Bayesian learning, 
we usually have solving the Maximum a Posteriori Estimation (MAP) problem: 


x* = arg max [log P(D|a) + log P(a)] (3) 


where P(D|a) denotes the likelihood of an observed variable D, P(a:) denotes the prior of the hidden 
variable x. We find out that the problem (3) can be rewritten as form as: 
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xz* = arg min{— log P(D|a) — log P(a)] (4) 


We also notice that the problem (4) turns out the problem (2) where g(a) = — log P(D|a) and h(a) = 
—log(P(a)). To solve the problem (3), by changing the learning rate in OFW algorithm [16] and considering 
carefully about the theoretical aspect, OPE [17] is proposed for solving the MAP estimation problem in many 
probabilistic models. Comparing with CCCP [4] and SMM [18], OPE has many preferable properties. The 
first, the convergence rate of CCCP and SMM is unknown for non-convex problems. The second, while each 
iteration of SMM requires us to solve a convex problem, each iteration of CCCP has to solve a non-linear 
equation system which is expensive and non-trivial in many cases. We find out that each iteration of OPE 
requires us to solve a linear program which is significantly easier than a non-linear problem. Therefore, OPE 
promises to be much more efficient than CCCP and SMM. The convergence rate of OPE is significantly faster 
than that of PMD [19] and HAMCMC [20]. 

In this section, we figure out more important characters of OPE, some were investigated in [17]. 
In general, the optimization theory has encountered many difficulties in solving a non-convex optimization 
problem. Many methods are only good in theory but inapplicable in practice via careful researches. Therefore, 
instead of directly solving the non-convex optimization with the true objective function f(a), OPE constructs 
a sequence of stochastic functions F(a.) that approximates to the objective function of interest by alternatively 
choosing uniformly from {g(a), h(a)} at each iteration t. It is guaranteed that F; converges to f when t > oo. 
OPE is one of the stochastic optimization algorithms. OPE is straightforward to implement, computationally 
efficient and suitable for problems that are large in terms of data and/or parameters. [17] has experimentally and 
theoretically showed the effectiveness of OPE when applying to the posterior inference of the Latent Dirichlet 
Allocation model. The main idea of OPE is to construct a stochastic sequence F;,(a) that approximates for f(a) 
by using uniform distribution, so that (2) becomes easy to solve. Although OPE is better than other methods 
before, we want to explore a new stochastic optimization algorithm to solve the problem (2) more efficient. We 
find out some limitations such as follows: Uniform distribution is too simple, then it is not suitable for many 
problems. Using one approximation function replacing the true objective is not more effective than using two 
approximation bounds. 

After finding out the drawback of OPE, we do some improvements in order to get a new algorithm, 
that is GS-OPT. It makes sense that two stochastic approximating sequences of objective function f(a) is better 
than one. So, using Bernoulli distribution, we construct two sequences that are both converging to f(a), one 
begins with g(a) called the sequence {L;}, the other begins with h(a) called the sequence {U,}. We adjust 
g and h according to Bernoulli parameter p € (0,1): G(x) := g(x)/p, H(a) := h(a)/(1 — p). We set 
fi := G(a). Pick f} as a Bernoulli variable with probability p € (0,1) where P(f} = G(x)) = p, P(f! = 
H(x)) =1—-p, t = 2,3,.... Then, we set Ly := 1h _ fy. Similarly, we set fi! := H(a). Pick ff! as 
Bernoulli distribution from {G(a), H(a)} with probabiity 2 € (0,1) where P(f’ = G(a)) =p, P(ff = 
H(x)) =1-—p, t = 2,3,.... Then, we set U, := + _, f. With our construction, we make sure two 
sequences {L;} and {U;} both converge to f when t > ae. Using both two stochastic sequences {L;} 
and {U;} at each iteration gives us more information about objective function f(x), so that we can get more 
chances to reach a minimal of f(a). We approximate the true objective function f(a) by F(a) which is a 
linear combination of U; and L; with a suitable parameter v € (0, 1): 


F(a) := vU;(a) + (1 — v) Li (ax) 


The usage of both bounds is stochastic and helps us reduce the possibility of getting stuck at a local 
stationary point and this is an efficient approach for escaping saddle points in non-convex optimization. So, 
our new variant seems to be more appropriate than OPE. Although GS-OPT aims at increasing randomness, 
GS-OPT works differently with OPE. While OPE constructs only one sequence of function F;, at each iteration 
t, GS-OPT constructs three sequences {L,}, {U;} and {F;}, in which {F;} depending on {U;} and {L;}. So, 
the structure of the main sequence F is actually changed. Details of GS-OPT are presented in Algorithm 1. 
Uniform distribution is a special case of Bernoulli one with parameter p = 0.5. So OPE is not flexible in many 
datasets. GS-OPT adapts well with different datasets, we will show it in our experiments. In the rest of this 
section, we will show that GS-OPT preserves the key advantage of OPE which is the guarantee of the quality 
and convergence rate. 
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Algorithm 1 GS-OPT: A new General Stochastic OPTimization algorithm for solving the non-convex problem 


Input: Bernoulli parameter p € (0, 1) and linear combination parameter v € (0, 1) 
Output: x* that minimizes f(x) = g(a) + h(a) on Q. 
1: Initialize a1 arbitrarily in Q 
2 Set Gla) := g(ae)/p, H (a) := h(«)/(1—p) 
3 fl:= Gla), fl:= H(a) 
4: for t = 2,3,...codo 
Pick f! as a Bernoulli variable where P(f} = G(x)) = p, P(f}/ = H(x)) =1-—p 
t 
Ly = be fh 
Pick f}' as a Bernoulli variable where P(f}' = G(x)) =p, P( ff’ = H(a)) =1—p 
bao ay 
Uy := ¢ het Sh 
F,:=1U,+ 1-v)i, 
10:  @, :=argMingeg < Fi{(x1),x > 
11: Lt4+1 := Lt + ae 
12: end for 


NO 100! ed SON. 


Theorem | (Convergence of GS-OPT algorithm) Consider the objective function f (a) in equation (2), 
the linear combination parameter v € (0, 1) and Bernoulli parameter p € (0,1). For GS-OPT, with probability 
one, F(a) converges to f(x) as t + +00 for any x € Q and x; converges to a local minimal/stationary point 
of f(a) at a rate of O(1/t). 

The proof of Theorem | is similar in [17]. The objective function f(a) is non-convex. The criterion 
used for convergence analysis is the importance of non-convex optimization. For unconstrained problems, the 
gradient norm ||V f(a)|| is typically used to measure convergence, because ||V f(a)|| — 0 captures conver- 
gence to a stationary point. However, this criterion can not be used for constrained problems. Instead, we use 
the ’Frank-Wolfe gap” criterion [21]. 

Let a; and b, = t—a, be the number of times that we have already picked G(x) and H (a) respectively 
after ¢ iterations to construct sequence {L;}. We have a; ~ B(t,p) and E(a:) = tp, D(a) = tp(1 — p). 
Then S; = a, — tp + N(0, tp(1 — p)) when t > oo. So S;/t > 0.as t + 00 with probability 1. We have 


L,-f=—(G-AH), l,-f' = 2(@'—H) 

Thus, we find out that L, — fas t — +00 with probability 1. Similarly, we also have U; — fas t + +00 
with probability 1. In addition, we have F, = vU, + (1 —-—v)Z, > Fp - ff =v(Ui.-— f)+ (0 -v)\(i - f). 
We notice that U; and L; tend to f(a) as t + +00 with probability 1. Then, we conclude that the sequence 
F(a) — f(a) as t — +00 with probability 1. We will show the efficient of GS-OPT algorithm via our 
experiments when we apply GS-OPT for solving the posterior inference problem in topic models in the next 
section. 


3. APPLYING GS-OPT FOR THE MAP PROBLEM IN TOPIC MODELS 

Latent dichlet allocation (LDA) [22] is a generative model for modeling text and discrete data. 
It assumes that a corpus is composed from K topics 8 = (),...,8,), each of which is a sample from 
V—dimensional Dirichlet distribution, Dirichlet(n). Each document d is a mixture of those topics and is 
assumed to arises from the following generative process: draw Oqg|a ~ Dirichlet(a). For the n‘*” word of 
d: draw topic index zgn|0qa ~ Multinomial(@q) and word wan|Zan, 8 ~ Multinomial(G,,,). Each topic 
mixture 09g = (0a1,.--, aK) represents the contributions of topics to document d, while ;,; shows the con- 
tribution of term j to topic k. Note that 9g € Ax, Gy € Ay, Vk. Both Og and zg are unobserved variables and 
local for each document as shown in Figure 1. 


According to [23], the task of Bayesian inference (learning) given a corpus C = {dj,...,dyz} is 
to estimate the posterior distribution p(z, 0, 3|C, a, 7) over the latent topic indicies z = {z1,..., za}, topic 
mixtures 9 = {0;,...,O@y¢}, and topics 8 = (G,,...,8,). The problem of posterior inference for each 


document d, given a model {3, a}, is to estimate the full joint distribution p(za, 0a, d|3, cw). Direct estimation 
of this distribution is intractable. Hence existing approaches use different schemes such as variational Bayes 
(VB) [22], collapsed variational Bayes (CVB) [23], CVBO [24], and collapsed Gibbs sampling (CGS) [25, 
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26]. We find out that VB, CVB and CVBO try to estimate the distribution by maximizing a lower bound of 
the likelihood p(d|G,a), whereas CGS tries to estimate p(za|d,3,a). The efficiency of LDA in practice 
is determined by the efficiency of the inference method being employed. However, none of the mentioned 
methods has a theoretical guarantee on quality and convergence rate. 

We consider the MAP estimation of topic mixture for a given document d: 


Q” = arg max P(6,d|8,0) = arg max P(d|6,8)P(6|a) (5) 
Problem (5) is equivalent to the following: 
K K 
0* = arg jax os d; los 2 9.83 + (a—1) 2, log 0, (6) 
j = = 


And we rewrite (6) as form as 
K K 
o* = arg amin [(— ) 0d; los 2 6rFns) oe ee 0) ) toe 6 (7) 
J — —- 


We find out that (7) is a non-convex optimization problem when a < 1. This optimization problem is 
usually non-convex and NP-hard in practice [27]. We denote 


K K 
g(9) = —S- dj log S > Ox Bp;, h(O) = (1—a) Slog % 
gj k=1 k=1 


then the objective function f(@) = >7, dj log ir 64.8r;3 + (a@—1) Saar log 0, = g(@) + h(@). We see that 
problem (7) is one case of (2), then we can use GS-OPT algorithm to solve problem (6). 


OO -@ -@<_ | ®<-® 
M K 


Figure 1. Latent dichlet allocation model 


We have seen many attractive properties of GS-OPT that other methods do not have. We further show 
the simplicity of using GS-OPT for designing fast learning algorithms for topic models. More specifically, 
based on two learning algorithms with LDA which are ML-OPE and Online-OPE [17], we design two algo- 
rithms: ML-GSOPT which enables us to learn LDA from either large corpora or data streams, Online-GSOPT 
which learns LDA from large corpora in an online fashion. These algorithms employ GS-OPT to do MAP in- 
ference for individual documents, and the online scheme or streaming scheme to infer global variables (topics). 
Details of ML-GSOPT and Online-GSOPT are presented in Algorithm 2 and Algorithm 3. 


Algorithm 2 ML-GSOPT for learning LDA from massive/streaming data 
Input: data sequence K,a,7 > 0,« € (0.5, 1] 
Output: G 
1: Initialize G° randomly in Ay 
2: fort = 1,2,...codo 
3: Pick a set C, of documents 
Do inference by GS-OPT for each d € C; to get 04, given 3°! 


4 
at x 

5: Compute intermediate topic 8 as 5; « acc, dj ak 

6: Set step-size: pp = (t +7)~" 

7 

8 


: Update topics: 3° := (1 — p,)B°~* + od 
: end for 
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Algorithm 3 Online-GSOPT for learning LDA from massive data 
Input: Training data C with D documents, K, a, 7, T > 0, & € (0.5, 1] 
Output: A 
1: Initialize X° randomly 
2: fort = 1,2,...codo 
3: Sample a set C; consisting of S documents, 
4: _ Use GS-OPT to do posterior inference for each document d € C;, given the global variable B'~' x A‘! 
in the last step, to get topic mixture 0g. Then, compute @g as dajk K Oar Brj 
5: For each & € {1,2,...,A}, form an intermediate global variable DN for C, by Nes = nt 
3 Laec, 4) Pair 
6: Update the global variable by A’ := (1 — pz)A°} + pA where p; = (t+ 7)7" 
7: end for 


4. EMPIRICAL EVALUATION 

This section is devoted to investigating practical behaviors of GS-OPT, and how useful it is when 
GS-OPT is employed to design two new algorithms for learning topic models at large scales. To this end, we 
take the following methods, data-sets, and performance measures into investigation. 


4.1. Datasets: 

We used the two large corpora: PubMed dataset consists of 330,000 articles from the PubMed central 
and New York Times dataset consists of 300,000 news. The data sets were taken from 
http://archive.ics.uci.edu/ml/datasets. For each data set, we use 10,000 documents for the test set. 


4.2. Parameter settings: 
To compare our methods with another ones, almost of free parameters are the same as in [17]. 
a. Model parameters: The number of topics = 100, the hyper-parameters a = rd and the topic Dirichlet 
parameter 7 = +. These parameters are commonly used in topic models. 
b. Inference parameters: The number of iterations is chosen as T = 50. 
c. Learning parameters: « = 0.9, 7 = 1 adapted best for existing inference methods. 


We do many experiments with two scenarios: (1) Choosing Bernoulli parameter p € {0.30, 0.35,..., 
0.65, 0.70} with mini-batch size |C;,| = 25, 000; (2) Choosing the Bernoulli parameter p € {0.1,0.2,...,0.9} 
with the mini-batch size |C;| = 5,000. We do experiments with GS-OPT by choosing the linear combination 
parameter v = 0.3 on New York Times and v = 0.1 on PubMed. We also can do much more experiments to 
examine the effect of the parameter v € (0,1) in GS-OPT. 


4.3. Performance measures: 

We used Log Predictive Probability (LPP) and Normalized Pointwise Mutual Information (NPMI) to 
evaluate the learning methods. NPMI [28] evaluates semantics quality of an individual topic. From extensive 
experiments, [28] found that NPMI agrees well with human evaluation on the interpretability of topic models. 
Predictive probability [26] measures the predictability and generalization of a model to new data. 


4.4, Evaluation results: 
4.4.1. Inference methods: 

Variational Bayes (VB) [22], Collapsed variational Bayes (CVB, CVBO) [24], Collapsed Gibbs sam- 
pling (CGS) [26], OPE [17], and GS-OPT. CVBO and CGS have been observing to work best by several 
previous studies [24, 26]. Therefore, they can be considered as the state-of-the-art inference methods. 


4.4.2. Large-scale learning methods: 

ML-GSOPT, Online-GSOPT, ML-OPE, Online-OPE [17], Online-CGS [26], Online-CVB [24], Online- 
VB [29]. 

To avoid randomness, the learning methods for each dataset are run five times and reported their 
average results. By changing of variables and bound functions, we obtain GS-OPT which is more effec- 
tive than OPE. GS-OPT has parameter v € (0,1) when constructing a linear combination F; of U; and Ly, 
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then experimental results of GS-OPT is bad or good depending on how v is chosen. We do some exper- 
iments for learning LDA with GS-OPT algorithm on two data-sets via choosing Bernoulli parameter p € 
{0.30, 0.35,..., 0.65, 0.70} with mini-batch size |C,| = 25,000. [17] shows that OPE is better than previous 
methods. Thus, we compare our method with OPE via LPP and NPMI measures and on two datasets. Details 
of our experimental results on this case are shown in Figure 2. 


on New York Times on Pubmed 
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(b) Online-GSOPT 


Figure 2. Results of GS-OPT with different value of p with mini-batch size |C,| = 25,000. Higher is better, 
(a) ML-GSOPT, (b) Online-GSOPT 


Via our experiments, we find out that using Bernoulli distribution and two bounds of the objective 
function in GS-OPT can give results better than OPE. We also see that GS-OPT depends on Bernoulli pa- 
rameter p chosen. We have made further experiments by dividing the data into smaller mini-batches such as 
|C,| = 5, 000 and choosing the Bernoulli parameter p more extensive, such as p € {0.1,0.2,...,0.8,0.9}, and 
parameter v = 0.3 on New York Times and v = 0.1 on Pubmed dataset. Details of our experimental results on 
this case are shown in Figure 3. 

We find out that GS-OPT gives the different results which depend on Bernoulli parameter p and 
parameter v chosen. We also find out that using mini-batch size |C;| = 5, 000 is better than using mini-batch 
size |C,| = 25,000. It means LPP and NPMI in case of mini-batch size |C,| = 5,000 are higher than in case 
of |C;| = 25,000. In addition, we find out that Online-GSOPT is better than Online-VB, Online-CVB and 
Online-CGS on two datasets with LPP and NPMI measures. Details of these results are shown in Figure 4. 
This explains the contribution of the prior/likelihood of solving the inference problem. 
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Figure 3. Results of GS-OPT with different value of p with mini-batch size |C,| = 5,000. Higher is better, 


On New York Times 


(a) ML-GSOPT, (b) Online-GSOPT 


On Pubmed 
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Figure 4. Performance of different learning methods as seeing more documents. Higher is better. 
Online-GSOPT is better than Online-OPE, Online-VB, Online-CVB and Online-CGS. We choose mini-batch 
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size |C,| = 5,000 


Int J Artif Intell ISSN: 2252-8938 0 191 


5. CONCLUSION 


In this paper, we propose GS-OPT, a new algorithm solving efficiently the non-convex optimization 
problems. Using Bernoulli distribution and stochastic approximations, we provide the GSOPT algorithm to 
deal well with the posterior inference problem in topic models. The Bernoulli parameter p in GS-OPT is seen 
as the regularization parameter that helps the model to be more efficient and avoid over-fitting. By exploiting 
GS-OPT carefully in topic models, we have arrived at two efficient methods for learning LDA from large 
corpora. As a result, they are good candidates to help us deal with text streams and big data. In addition, 
GS-OPT is flexible then we can apply it to solve more and more the non-convex problems in machine learning. 
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